MoE multi-chip experts example by puddingfjz · Pull Request #720 · hw-native-sys/simpler

puddingfjz · 2026-05-08T06:38:13Z

Summary

This PR adds a focused L3 example for a distributed MoE-style workflow with
one expert per chip.

The example demonstrates:

L3 multi-chip worker setup with HCCL bootstrap windows
Cross-rank dispatch of per-expert token slices
Per-rank expert compute on the dispatched recv buffer
Cross-rank combine back to each source rank's output tensor
A pytest wrapper for running the end-to-end hardware case

The pipeline is intentionally small (NUM_TOKENS = 10, HIDDEN_DIM = 16,
COUNT = 4) so the data movement is easy to inspect while still exercising
dispatch, compute, and combine across chips.

Testing

conda run -n simpler_issue python3 -m py_compile examples/workers/l3/moe_multi_chip_experts/main.py examples/workers/l3/moe_multi_chip_experts/ test_moe_multi_chip_experts.py
Hardware pytest through task-submit on devices 9,10: 1 passed in 8.50s

Test Configuration: - 4 experts (one per chip) - 10 tokens in context - 4 tokens processed per expert - Hidden dimension: 16 IMPORTANT: Current implementation tests DATA FLOW only, not actual MoE computation: - Compute phase is a simple +1.0 operation, not expert network computation - Focus is on verifying correct token routing and result gathering - Can be extended to add real expert models later Core Components: - Kernels: dispatch (all-to-all), compute (+1.0), combine (all-to-all) - Orchestration: end2end, dispatch-only, combine-only, dispatch+compute - Unit Tests: test_dispatch_only, test_combine_only, test_dispatch_compute - E2E Test: test_end2end with unique value tracing KEY DESIGN: Use INDEPENDENT scratch_test buffer for combine phase - Problem: Reusing scratch caused combine to read stale dispatch data - Solution: Dispatch+Compute use scratch, Combine uses scratch_test - Prevents corruption when combine's stage-in doesn't fully overwrite dispatch's data (writes 4 tokens, stride based on 10 NUM_TOKENS) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-05-08T06:38:19Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

- Keep the example focused on the end-to-end dispatch, compute, and combine path - Remove obsolete debug docs, partial tests, and unused kernel variants - Align README, test naming, and scratch buffer handling with the current two-chip hardware test

puddingfjz force-pushed the moe_distributed_demo branch 6 times, most recently from 94fe535 to 1fb325a Compare May 8, 2026 08:16

puddingfjz force-pushed the moe_distributed_demo branch from 1fb325a to d47f536 Compare May 8, 2026 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE multi-chip experts example#720

MoE multi-chip experts example#720
puddingfjz wants to merge 2 commits intohw-native-sys:mainfrom
puddingfjz:moe_distributed_demo

puddingfjz commented May 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

puddingfjz commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

puddingfjz commented May 8, 2026 •

edited

Loading